The optimization function of a neural network is highly non-linear, which has lots of local optima. When the parameter space is large, and there are many local optima, it makes sense to spend some effort in picking good initialization points.

Practical Issues In Neural Networks Tranining - Difficulties in Convergence

Go to the bottom of the page to watch a video on this topic

Sufficiently fast convergence of the optimization process is difficult to achieve with verydeep networks, as depth leads to increased resistance to the training process in terms of letting the gradients smoothly flow through the network. This problem is somewhat related to the vanishing gradient problem, but has its own unique characteristics. Therefore, some "tricks" have been proposed in the literature for these cases, including the use of gating networks and residual networks. These methods will be discussed in the coming posts.

Local and Spurious Optima

The optimization function of a neural network is highly non-linear, which has lots of local optima. When the parameter space is large, and there are many local optima, it makes sense to spend some effort in picking good initialization points. One such method for improving neural network initialization is referred to as pretraining. The basic idea is to use either supervised or unsupervised training on shallow sub-networks of the original network in order to create the initial weights. This type of pretraining is done in a greedy and layer-wise fashion in which a single layer of the network is trained at one time in order to learn the initialization points of that layer. This type of approach provides initialization points that ignore drastically irrelevant parts of the parameter space to begin with. Furthermore,unsupervised pretraining often tends to avoid problems associated with over-fitting. The basic idea here is that some of the minima in the loss function are spurious optima because they are exhibited only in the training data and not in the test data. Using unsupervised pretraining tends to move the initialization point closer to the basin of "good" optima in the test data. This is an issue associated with model generalization. Methods for pretraining will be discussed in the coming posts.

Interestingly, the notion of spurious optima is often viewed from the lens of model generalization in neural networks. This is a different perspective from traditional optimization. In traditional optimization, one does not focus on the differences in the loss functions of the training and test data, but on the shape of the loss function in only the training data. Surprisingly, the problem of local optima (from a traditional perspective) is a smaller issue in neural networks than one might normally expect from such a non-linear function. Most of the time, the non-linearity causes problems during the training process itself (e.g., failure to converge), rather than getting stuck in a local minimum.

Computational Challenges

A significant challenge in neural network design is the running time required to train thenetwork. It is not uncommon to require weeks to train neural networks in the text and image domains. In recent years, advances in hardware technology such as Graphics Processor Units(GPUs) have helped to a significant extent. GPUs are specialized hardware processors that can significantly speed up the kinds of operations commonly used in neural networks. In this sense, some algorithmic frameworks like Torch are particularly convenient because they have GPU support tightly integrated into the platform.

Although algorithmic advancements have played a role in the recent excitement around deep learning, a lot of the gains have come from the fact that the same algorithms can do much more on modern hardware. Faster hardware also supports algorithmic development,because one needs to repeatedly test computationally intensive algorithms to understand what works and what does not. For example, a recent neural model such as the long short-term memory has changed only modestly since it was first proposed in 1997 . Yet,the potential of this model has been recognized only recently because of the advances in computational power of modern machines and algorithmic tweaks associated with improved experimentation.

One convenient property of the vast majority of neural network models is that most of the computational heavy lifting is front loaded during the training phase, and the prediction phase is often computationally efficient, because it requires a small number of operations (depending on the number of layers). This is important because the prediction phase is often far more time-critical compared to the training phase. For example, it is far more important to classify an image in real time (with a pre-built model), although the actual building of that model might have required a few weeks over millions of images. Methods have also been designed to compress trained networks in order to enable their deployment in mobile and space-constrained settings. These issues will be discussed in the coming posts.